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Abstract 

The overlap operator in lattice QCD requires the computation of the sign function of a matrix, which is non-Hermitian 
in the presence of a quark chemical potential. In previous work we introduced an Arnoldi-based Krylov subspace 
approximation, which uses long recurrences. Even after the deflation of critical eigenvalues, the low efficiency of 
the method restricts its application to small lattices. Here we propose new short-recurrence methods which strongly 
enhance the efficiency of the computational method. Using rational approximations to the sign function we introduce 
two variants, based on the restarted Amoldi process and on the two-sided Lanczos method, respectively, which become 
very efficient when combined with multishift solvers. Alternatively, in the variant based on the two-sided Lanczos 
method the sign function can be evaluated directly. We present numerical results which compare the efficiencies of 
a restarted Arnoldi-based method and the direct two-sided Lanczos approximation for various lattice sizes. We also 
show that our new methods gain substantially when combined with deflation. 



1. Introduction 

While this paper discusses new numerical methods that are expected to be useful in a large number of applications, 
the main motivation for these new methods comes from quantum chromodynamics (QCD) formulated on a discrete 
space-time lattice. QCD is the fundamental theory of the strong interaction. Being a non-Abelian gauge theory, 
it is notoriously difficult to deal with. Lattice QCD is the only systematic non-perturbative approach to compute 
observables from the theory, and it is amenable to numerical simulations. The main object relevant for our discussion 
is the Dirac operator, for which there exist several formulations that differ on the lattice but are supposed to give the 
same continuum limit when the lattice spacing is taken to zero. We are focusing on the overlap Dirac operator D^^ 
im 121, which is the cleanest formulation in terms of lattice chiral symmetry ||3] |3| but very expensive in terms of 
the numerical effort it requires. Trying to improve algorithms dealing with the overlap operator is an active field of 
research, and even small improvements can have an impact on the large-scale lattice simulations that are being run by 
the lattice QCD collaborations worldwide. 

The overlap operator is essentially given by the sign function of its kernel, which we assume is the usual Hermitian 
Wilson operator H^r = ysD^Y (see |5J for the notation). On the lattice, this operator is represented by a sparse matrix, 
and on current production lattices the dimension of this matrix can be as large as 10** ~ 10^. The main numerical effort 
lies in the inversion of the overlap operator, which is done by iterative methods and requires the repeated application of 
the sign function of Hw on a vector At zero chemical potential //, Hw is Hermitian, and many sophisticated methods 
have been developed for this case (see, e.g., |6:|). However, one can also study QCD at nonzero quark chemical 
potential (or, equivalently, density), which is relevant for many physical systems such as neutron stars, relativistic 
heavy ion collisions, or the physics of the early universe. The overlap operator has been generalized to this case 
lEEl. While the result is formally similar to the one at yu = 0, it is in fact more complicated since Hw becomes a 
non-Hermitian matrix, of which we need to compute the sign function. This case is much less studied and the focus of 
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the present paper, which is a natural continuation of earlier work ||8]. For simplicity we will still refer to Hw - jsDw 
as the "Hermitian" Wilson operator. 

In mathematical terms, we investigate the computation of f{A)b, where A e C"^" is non-Hermitian and / is a 
general function defined on the spectrum of A such that the extension of / to matrix arguments is defined. For a simple 
definition of matrix functions we assume that A is diagonalizable and let A - RAR^^ be the eigendecomposition with 
R e C"^" and diagonal A containing the eigenvalues Ai, . . . , A„ e C. Then the matrix evaluation of / is defined as 

f(A) = Rf(A)R-' = Rdmg(f(Ai), . . .,f(A„))R-' . (1) 

Accordingly, if b = Ry e C" is a vector expressed in terms of the eigenvectors, then 

f{A)b^Rf(A)y. (2) 

For a thorough treatment of matrix functions see |9|; a compact overview is given in 1 10|. The case / = sign will be 
of particular interest. We use the standard definition signz = signRe(z) for z e C [9|, which in the physics case we 
are considering was also shown to follow from the domain-wall formalism Q. 

If A is large and sparse, f(A) is too costly to compute, whereas f(A)b can still be obtained in an efficient manner 
via a Krylov subspace method. 

The foundation for any Krylov subspace method is the computation of an appropriate basis for the Krylov subspace 
Kit{A, b) = span{b,Ab, . . . , A*"'^?). For Hermitian matrices an orthonormal basis can be built with short recurrences 
using the Lanczos process. For non-Hermitian matrices the corresponding process, which again computes an or- 
thonormal basis, is known as the Arnoldi process. It requires long recurrences and is usually summarized via the 
Arnoldi relation 

AVk = VkHk + hk+x,kVk+\el . (3) 

Here, Vk - ■ ■ ■ Iv^] e C"^*^ is the matrix which contains the computed basis vectors (the Arnoldi vectors), Hk - 
Vk^AVk is the upper Hessenberg matrix containing the recurrence coefficients hij, and ek denotes the k-th unit vector 
of Hk being upper Hessenberg reflects the fact that the computation of the next Arnoldi vector v^+i results in a 
long recurrence since the projection of Avk on all previous vectors vi, . . . ,Vk has to be subtracted from Avk- Long 
recurrences slow down computation and increase storage requirements, and thus become inefficient or even infeasible 
if k, the dimension of the Krylov subspace, becomes large. This is the reason why in this paper we investigate two ways 
to circumvent this problem for non-Hermitian matrices, i.e., restarts of the Arnoldi process and the use of the two- 
sided Lanczos process. We will consider these two methods in combination with a rational function approximation to 
/. In the case of two-sided Lanczos, we will also consider a direct evaluation of the function. 

This paper is organized as follows. In Section |2] we describe the two alternatives just mentioned to obtain short 
recurrences. In Section [3] we address several aspects of the important issue of deflation. Section [4] contains the 
descriptions of four different short-recurrence algorithms to compute the sign function, all of which use the preferred 
method of LR deflation. In Section[5]we discuss the choice of the rational function to approximate the sign function. 
Our numerical results are presented in Section|6] and conclusions are drawn in Section]?] 

2. Short recurrences for non-Hermitian matrices 

For simplicity, we assume \\b\\ = 1 from now on. The standard Krylov subspace approach, introduced in ifTTl (see 
also ||9l), to obtain approximations to the action f{A)b is to compute 

f{A)b ^ VkVk^f{A)b = VkVk^ fiAWkex ^ Vkf{Hk)e, (4) 

for some suitable k n. Here, ei denotes the first unit vector. We refer to |l8][T2l for a discussion in the context of 
the overlap operator at nonzero chemical potential. 

The approximation Q can be viewed as a projection approach. The operator A is orthogonally projected onto 
Kk{A,b), the projection being represented by Hk - V^'AV^. We then compute f{Hk)e\, i.e., we evaluate the matrix 
function of / for the projected operator, applied to the projected vector ei - Vt^b. This result is finally lifted back 
to the larger, original space by multiplication with Vk- The matrix function /(Hk), where Hk is of small size, can be 
evaluated using existing schemes for matrix functions, e.g., by computing the eigendecomposition of Hk or by using 
iterative schemes like, in the case of / = sign, Roberts' iterative scheme based on Newton's method, see, e.g., [9|. 
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2.1. Restarting the Arnoldi process 

To prevent recurrences from becoming too long for Q one could, in principle, use a restart procedure. This means 
that one stops the Arnoldi process after k^^y^ iterations. At this point we have a, possibly crude, approximation Q to 
f(A)b, and to allow for a restart one now has to express the error of this approximation anew as the action of a matrix 
function, fi(A)bi, say. It turns out that this can indeed be done, see fl3l . at least in theory, with /i defined as a divided 
difference of / with respect to the eigenvalues of H/i^^^,^ and with bi = vj^^^,^^, the last Arnoldi vector of the previous step. 
Unfortunately, however, this may result in a numerically unstable process, so that after a few restarts the numerical 
results become useless. For details, see lfT3]| . 

An important exception arises when / is a rational function of the form 



s 

(Jjj 



We then have 

s 

/(A)/7 = (6) 

1=1 

where the x^''\ i - I,. . . ,s, are solutions of the s shifted systems 

(A - cr,/„)x<'> = b (7) 

and /„ is the n x n unit matrix (we will frequently suppress the index on /). For A large and sparse, these shifted 
systems cannot be solved efficiently by direct methods. Using the Arnoldi projection approach outlined before, the 
current approximation for f{A)b is obtained as 

s 

= ^ oJixf with xf = Vk{Hk - o-ihT^ei , i^\,...,s. (8) 

/=i 

Note that Krylov subspaces are shift invariant, i.e., Kk{A, b) — Kk{A — o-J, b), and that the Arnoldi process applied 
to A - cr,/ instead of A produces the same set of Arnoldi vectors, i.e., the same matrices Vk with Hk replaced by 
the shifted counterpart Hk - cr,/, see lfT4l [TSj . This shows that the vectors x['' in ([8| are the iterates of the full 
orthogonalization method FOM, see for the linear systems 

(A - cr,/)x ^b. (9) 

A crucial observation is that for any k the individual residuals r['^ - b - (A - o-iI)x^^^ of the FOM iterates are just 
scalar multiples of the Arnoldi vector Vk+i, see, e.g., ifTTlfTSl . i.e., 

rf=p<Vi, i^h...,s, (10) 

with collinearity factors p*'* e C. The error ek - f(A)b - xj of the approximation at step k can therefore be expressed 
as 

(0 



ek = /i(A)v*+i , where /i(f) = V . (11) 



This allows for a simple restart at step k^ji-^^ of the Arnoldi process, with the new function fi again being rational 
with the same poles as /. For this reason, the stability problems that are usually encountered with restarts for general 
functions / do not occur here. 

The restart process just described can also be regarded as performing restarted FOM for each of the individual 
systems (A - cr,7)x = b, i - I,. . .,s (and combining the individual iterates appropriately), the point being that, 
even after a restart, we need only a single Krylov subspace for all s systems, see fW\. Restarted FOM is not the only 
"multishift" solver based on a single Krylov subspace to compute approximations to fiA)b by combining approximate 
solutions to (A - cr,/)x = b. An important alternative to FOM is to use restarted GMRES for families of shifted linear 
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systems as presented in lfT9l . This method also relies on the restarted Arnoldi process, but now a difference has to 
be made between the seed system, for which "true" restarted GMRES is performed, and the other systems, for which 
a variant of GMRES is performed which keeps the residuals collinear to that of the seed system. The convergence 
analysis in |fT9l shows that this approach is justified if A is positive real (i.e., Re(ji: ' Ajc) > for all x i^Q) and all shifts 
are negative}^ Indeed then, taking as the seed system the one belonging to the shift which is smallest in modulus, ctj 
say, the residuals of all the other systems — for which we do not perform "true" GMRES — are smaller in norm than 
the residual for cri . But for the first system we do perform true restarted GMRES which is known to converge under 
the assumptions made. 

2.2. The two-sided Lanczos process 

Another way to obtain short recurrences when computing a basis for the Krylov subspaces for non-Hermitian 
matrices is to replace the Arnoldi process by the two-sided Lanczos process. The two-sided Lanczos process builds 
two biorthogonal bases V] , . . . , v;; and w\,. . . , Wk for the two Krylov subspaces Kk(A, b) and Kk{A\b), respectively. 
Here, S is a so-called shadow vector which can be chosen arbitrarily. We always chose b - b motivated by the fact 
that then for — > one recovers the standard Lanczos method for which the projection on the Krylov subspace (see 
( fl4l ) below) is orthogonal. With Vu - [vi| ■ ■ ■ \vk\ and Wk - [wi| ■ ■ ■ \wk\ we thus have Vk'^Wk - h, and the resulting 
recurrences can be summarized as 

AVk = VkHk + hk+x,kVk+\el , (12) 
A^Wk = WkHk^ + hkMi^k^xel , (13) 

where Hk - wj^AVk is tridiagonal. Note that an iteration will now require two matrix-vector multiplications, one by 
A and one by A^ In principle, the choice of b can substantially influence the two-sided Lanczos process, which can 
even break down prematurely or run into numerical instabilities. With our choice ofb-b such undesirable behavior 
never occurred in our numerical experiments. 

The matrix VkWk'' now represents an oblique projection, and in analogy to Q we get the approximations 

f{A)b ^ VkWk^fiA)b = VkWk^f(A)Vke, * Vkf(Hk)ei. (14) 

A first report on the application of ( [T4] l to the overlap operator with chemical potential can be found in 11211 . 
If / is a rational function, see (|5]l, the approximation ( [T4] l can be expressed as 

s 

mb^Yj'^ixf WAhxf ^VkiHk-cTiiy'e,. (15) 
/=i 

Since, just as the Arnoldi process, the two-sided Lanczos process creates the same vectors Vk,Wk if one passes 
from A to A - cr,7 with the projected matrix Hk passing to Hk - cr,7, this shows that for all / the vectors x^'' are just 
the BiCG iterates for the systems (A - cril)x - b. In other words: If / is a rational function, the approximation ( [T4| l is 
equivalent to performing "multishift" BiCG, see lfT7l l22i (and recombining the individual iterates xj,'' as ti^''). 
Although no breakdowns were observed in the numerical experiments of Ref. Il20l . for reasons of numerical stability 
one might prefer using the BiCGStab |23| or QMR method 1241 instead of BiCG. Both also rely on the two-sided 
Lanczos process, and efficient multishift versions exist as well, see 107112211251 . 

2.3. Summary and comparison 

To summarize, so far we have presented the following approaches to developing short-recurrence methods to 
iteratively approximate f(A)b: 

1. Methods based on restarted Arnoldi: 



' In our case, A = is positive real if all the eigenvalues of Hw have their angles in(-^,j)U(^,^), which will be true if /i is sufficiently 
small. Expeiimentally, even for larger values of ji, we did not encounter any convergence problems in numerical expeiiments 1201 . 
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a) Approximate / by a rational function g. Then use multishift restarted FOM or multishift restarted GMRES 
^oxg{A)b. 



b) Apply the restarted Arnoldi process directly. As discussed at the beginning of Section 2.1 this is not 
possible in computational practice because of stability problems. 

2. Methods based on two-sided Lanczos: 

a) Approximate / by a rational function g. Then use multishift BiCG/BiCGStab/QMR for g{A)b. 

b) Use directly the approximation Vkf{Hk)e\ to the obUque projection VkWk^ f(_A)b for any /, see ([T4|. 

The corresponding algorithms will be given in Section]?] 

Note that short recurrences, in principle, result in constant work per iteration. However, for approach 2b) we will 
have to evaluate f(Hk) for a k x k matrix Ht, and this work will become substantial if k is large, see Proposition [2] 
belowj^j Also, in approach 2b) we have to store all vectors v'l , V2, . . . which may become prohibitive so that a two-pass 
strategy may be mandatory: The two-sided Lanczos process is run twice. In the first run, Hk is built up, but the vectors 
Vk, Wk are discarded. Once = f{Hk)e\ has been computed, the Lanczos process is run again, and the vectors are 
combined with the coefficients from jk to obtain the final approximation. Both of these drawbacks are not present in 
the other approaches. These, however, rely on the fact that we must be able to replace the computation of f{A)b by 
g{A)b with a sufficiently precise rational approximation g to /. 



3. Deflation 

In [8] two approaches to deflate eigenvectors were proposed for the Krylov subspace approximation (|4]). These 
deflation techniques use eigenvalue information, namely Schur vectors (Schur deflation) or left and right eigenvectors 
(LR deflation) corresponding to some "critical" eigenvalues. Critical eigenvalues are those which are close to a 
singularity of / since, if these are not reflected very precisely in the Krylov subspace, we get a poor approximation. 
In case of the sign function the critical eigenvalues are those close to the imaginary axis. In this section we describe 
both deflation methods and show how they can be combined with multishifts so that they can be used in approaches 
based on a rational approximation. We point out a serious disadvantage of Schur deflation, leaving LR deflation as the 
method of choice. For the sake of simplicity we present the deflation techniques without taking restarts into account. 
We will briefly comment on restarts after (]24]| below. 

We start with Schur deflation. Let S m - {s\\ ■ ■ ■ \s,„} be the matrix whose columns Sj are the Schur vectors of m 
critical eigenvalues of the matrix A. This means that we have S „,^S ,„ - !„ and 

AS,„=S,„T„,, (16) 

where T„, is an upper triangular matrix with the m critical eigenvalues of A on its diagonal, see fTl\ . Let us note that the 
Schur vectors span an invariant subspace of A, and that they can be computed via orthogonal transformations, which 
is very stable numerically. The extraction of the eigenvectors themselves is a less stable process if A is non-Hermitian. 
In the case of the shifted matrices A - cr,/, i - 1,. . .,s, and S„,, r„, computed with respect to A we have 

(A-o-iI)S,n^AS,n-o-iS,„^S,„(T„,-o-iI„), /=l,...,i. (17) 

Clearly, the matrix P = SmS,,,^ represents the orthogonal projector onto the subspace Q,„ - span{ii, . . . , s^}. The 
solutions to (]7]l are now approximated in augmented Krylov subspaces, 

xf en,„ + (I-P)K,(A,b). (18) 

In fact, the projected Krylov subspace (I - P)Kk(A, b), which is orthogonal to f2,„, is a Krylov subspace again, but now 
for (/ - P)A instead of A and (/ - P)b instead of b: Since Q.^ = range(P) is A-invariant, i.e., for any y there is a y such 
that APy - Py, we have 

(/ - P)A(I - P)y = (/ - P)Ay -{I- P)APy = (/ - P)Ay -{I- P)Py = (/ - P)Ay (19) 



^For the special case of / = sign, this problem is alleviated by a new method |26| that speeds up the evaluation of /(//t), thereby eliminating 
the 0{k^) term in Proposition|2] 
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and thus 



(/ - P)Kk{A, b) = span{(/ - P)b, (I - P)Ab, ...,(/- P)A*"'fe) 

= span{(/ - P)b, (I - P)A(I - P)b, ...,((/- P)A)''-Hl - P)b} 

^ Ku{{I - P)A,{I - P)b) . (20) 

To build a basis Vk - [vi| ■ ■ ■ |v^] for this Krylov subspace we have to multiply by (/ - P)A instead of A in every step, 
reflecting the fact that we have to project out the Q,„-part after every multiplication by A. This may result in quite 
considerable computational work: The work for one projection has cost 0{nm), because each of the m Schur vectors 
is usually non-sparse. 

We now turn to LR deflation. The idea is essentially the same as for Schur deflation, except that we use a different 
projector. As we will see below, this has a useful consequence: It removes the need to multiply by / - P in every 
step. Thus the 0{nm) effort for the projection step has to be paid only once, instead of once per iteration (but see the 
comment after Eq. (|22]i). 

The projector we use is an oblique projector onto Q„,, defined by P = RmLm, where R,„ - [ri | ■ ■ ■ |r„,] is the matrix 
containing the right eigenvectors and = [^il ■ ■ ■ \lm] ' is the matrix containing the left eigenvectors corresponding to 
the m critical eigenvalues of A. With A,„ the diagonal matrix with the m critical eigenvalues on its diagonal, the left 
and right eigenvectors satisfy 

AR„,^R,„A,n and = Am4 ■ (21) 

The left and right eigenvectors are biorthogonal and are normalized such that lI,R,„ - /„,, thus ensuring P^ - P. 

As in the Schur deflation the projected Krylov subspace (/ - P)Kk(A, b) is a Krylov subspace. It is no longer 
orthogonal to Qm because the projector is oblique, but it now is a Krylov subspace for the original matrix A since both 
range(f ) and range(/ - P) are A-invariant so that (/ - P)Ay - Ay for y e range(/ - P). Instead of (pO]i we now have 



(/ - P)Kt{A, b) = KkiA, (I - P)b) . (22) 

Therefore, no additional projection is needed within the Arnoldi method when we build up a basis Vk = [v'i| ■ ■ ■ Iv^] 
for this subspace. In computational practice, however, components outside of range(/ - P) will show up gradually 
due to rounding effects in floating-point arithmetic. It is thus necessary to apply (/ - P) from time to time in order to 



eliminate these components. We will come back to this point in Section 4.1 

The numerical accuracy of the computed eigenvectors turned out to always be sufficient in our computations. 
Therefore, because of its greater efficiency, from now on we concentrate on LR rather than Schur deflation. 

The overall approach is thus as follows: With the oblique projector P - RmL^m we split f(A)b into the two parts 

f(A)b^f(A)(Pb) + f(A)(I-P)b. (23) 

Since we know the left and right eigenvectors which make up P, using (|2]) we directly obtain 

xp = f(A)(Pb) = f(A)R,nLlb = R,nf(A,„)(Lib) . (24) 

The remaining part f(A)(I - P)b can then be approximated iteratively by any of the approaches discussed in Section|2] 
Since the only effect of LR deflation is the replacement of bhy {I - P)b in Q, no modifications are necessary when 
using one of the restarted approaches. 

There is a beneficial effect of deff ation on the number of poles to use when / is approximated by a rational function 
g. Let V be the coefficient vector of b when represented in the basis of right eigenvectors of A, i.e., b - Ry, and assume 
that we sorted them to put the critical eigenvectors first, i.e.. 



R^{R„AR^„r], y = 



ym 



A = 



Am 




(25) 



Then f(A)Pb - Rmf(\n)ym and f(A)(I - P)b — R^,„f(A^„,)y^„,. So when approximating f(A)(I - P)b via a rational 
function g, we have f{A){I - P)b ^ g{A){I - P)b - R^mg(A^„,)y^„,. This shows that we only have to take care that 
g approximates / well on the non-critical eigenvalues (those in A^^)- Consequently, a good approximation can be 
obtained using a smaller number of poles as compared to the situation where we would have to approximate well 
on the full spectrum of A. This pole-reduction phenomenon can be very substantial, even if we deflate only a small 
number of eigenvalues, see Section[6] 
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4. Algorithms 



In this section we present four algorithms, corresponding to the Ust in Section 2.3 to compute the action ( |23) 
of the sign function of a non-Hermitian matrix on a vector, using LR deflation for the first term f{A){Pb) and short- 
recurrence Krylov subspace methods for the remaining term f(A)(I - P)b. 



4.1. Restarted Arnoldi with rational functions 

In this subsection we discuss methods based on restarted Arnoldi, corresponding to la) in Section [23] We assume 
that the original function / is replaced by a rational function given by (|5]l which approximates the original function 
sufficiently well. The choice of the rational function will be discussed in Section|5] 

We start with LR-deflated restarted FOM. The resulting algorithm is given as Algorithm [T] where we use ([8| to 
obtain the iterates for all shifted systems in the current cycle, and where we give the details on how to obtain the 
collinearity factors p^'* from ( fTO] ) for the residuals, see also fTSl . Here, the notation FOM-LR(m, A;) indicates that 
we LR-deflate a subspace of dimension m and that we restart FOM after a cycle of k iterations. The vector x is the 
approximation to f(A)b. After the completion of each cycle we perform a projection step to eliminate numerical 
contamination by components outside of range(/ - P), as discussed in Section|3]after ( |22] i. 

Algorithm 1. Restarted FOM-LR(m, k) 

{Input m, k = ^max, A, {cTi, . . . , cr,), {wi, . . . , Wj), b,L^ L„„ R = R^, A = A,„) 

X = x/> = Rf{A)L^b 
r^(I- P)b 

while not all systems are converged do {loop over restart cycles] 
yS=||r||2 
VI = rip 

compute Vk, by running k steps of Arnoldi with A 
for / = 1, do 

yf ^l3p^'\Ht-(Tih)-'e, 
end for 

r = Vk+\ 

p^'^ = -hk+itelyf, / = 1, . . . , i 
r - (I - P)r [projection step] 
end while 



Since Algorithm [T] will be used in our numerical experiments, we now analyze the main contributions to its 
computational cost. 

Proposition 1. Let C„ denote the cost for one matrix-vector multiplication by the matrix A, and let ^tot be the total 
number of such matrix-vector multiplications performed. The computational cost of Algorithm^is given as 

^tot[c„ + n [0(^„ax) + 0{mlk^^^)]\ . (26) 

To see this, let us discuss the dominating contributions to the computational cost in one sweep of the while-loop. 
For simplicity we write k instead of k^^j^, as we also did in the algorithm. Computing V/, and with the Arnoldi 
process has cost kC„ + Oink^), since for j - \,...,k the y-th step requires one matrix- vector multiplication and 
j inner products, vector additions and scalings. Since we can solve systems with the upper Hessenberg matrices 
Hk - cr,4 with cost O(k^), the total cost for the computation of all the vectors y^'* is O(sk^), which can be neglected 
compared to the Oink^) cost contained in the Arnoldi process. Updating x with the linear combination of the columns 
of Vk has cost 0(ks + nk), which can again be neglected compared to the Oink^) cost in the Arnoldi process. The 
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final projection step has cost 0(mn). Multiplying these costs by the number «sweep of sweeps through the while-loop 
and using ^tot = "sweep^max gives the total cost. The initial steps prior to the while loop have cost 0{mri), which is 
dominated by the last term of Eq. ( p6| ). 

We now formulate the LR-deflated restarted GMRES algorithm. Let us first introduce the {k+ Y)xk matrix 



(27) 



through which the Arnoldi relation ([3]) can be summarized as AVk - Vk+\Hk- We choose the first system (with shift 
(Ti) to be the seed system, i.e., the system for which we run "true" restarted GMRES. This implies that we have to 
solve a least squares problem involving Hk - 0"i4 to get the corresponding iterate. Here the matrix 4 denotes the 
fc-dimensional identity matrix extended with an extra row of zeros. For the other shifts cr2, . . . , cTj we impose the 
collinearity constraint for the residuals. The corresponding iterates are now obtained via solutions of linear systems. 
For a detailed derivation we refer to fT9l, and the detailed algorithmic formulation is given in Algorithm|2] 



Algorithm 2. Restarted GMRES-LR(m, k) 
{Input m, k - fcmax, A, {cr,, . . . ,0-5), {wi, . 



, Us], b,L- L,„, R - R,„, A = A„ 



x = xp^ Rf{K)L^b 
r^(I- P)b 

= l,/ = 2,...,s 
y6 = IWl2 

while not all systems are converged do {loop over restart cycles] 
v'l = rip 

compute Vk, Hk by running k steps of Arnoldi for A 
compute 37^'' as the minimizer of ||j8ei - {Hk - o-\Ik)y\\2 
for i = 2, . . . , 5 do 



compute y^' and p^'' as the solution of the {k+ l)x(^+ 1) system ^Hk - a-Jk 



„('■) 



end for 

x + VkYj'=\ ^iyl 

r = r - Vk+i{Hk -a-ilk)y\ 

p«=p«,.- = 2,...,. 
r = (I — P)r {projection step] 
end while 



4.2. Two-sided Lanczos with rational functions 

We now turn to methods based on two-sided Lanczos, corresponding to 2a) in Section |2.3| In this case there is 
no need for restarts because the two-sided Lanczos process uses only short recurrences anyway. We summarize a 
high-level view of the resulting computational method using multishift BiCG as Algorithm[3] The changes necessary 
to obtain multishift BiCGStab/QMR should be obvious. 

4.3. Direct application of the two-sided Lanczos approach 

We now consider the two-sided Lanczos approach for f{A){I - P)b as given in ( [T4| l, corresponding to 2b) in 
Section 2.3 The resulting computational method is summarized as Algorithm]?] Note that due to the deflation 



this algorithm uses a modified shadow vector: We remove from b all critical eigenvector components belonging to 
the right eigenvectors of A\ i.e., the left eigenvectors of A. With this modified shadow vector, the biorthogonality 
relation enforced by the two-sided Lanczos process numerically helps keeping the computed basis for K{A, r) free of 
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Algorithm 3. BiCG-LR(m) 

{Input m. A, {cTi, . . . ,0-5), {coi,. . .,a>s], b,L- L„„ R - R„„ A = A,,,) 

xp = Rf(K)L^b 
r = {I- P)b 

for k - 1,2,... until all systems are converged do 

compute the k-i\\ BiCG iterates xf, i - 1, . . . , s, for the systems (A - cril)x^'^ - r 
end for 



Algorithm 4. Direct two-sided Lanczos-LR(TO, k) 
{Input m, k, A,b,L- L,„, R = R,„, A - A,„) 

Xp = Rf{A)L^b 
r^b- RL^b 

f — b - LR^b {the modified shadow vector] 

put v'l - r,/3 - \\r\\2, choose wi - r and normalize s.t. vjwi = 1 

for j = l,2,...,fedo 

update Hj, compute v^+i and wj+i from the two-sided Lanczos process ( [T2| , ( [T3| l 
end for 

put Xk^xp+p- Vkf(Hk)ei 



contributions from the right critical eigenvectors, as it should be in exact arithmetic. In Algorithm|4j the parameter m 
denotes the number of deflated eigenvalues, and k is the maximum dimension of the Krylov subspace being built, a 
parameter which has to be fixed a priori. 

We now analyze the main contributions to the computational cost of Algorithm]?] which will also be used in our 
numerical tests. 

Proposition 2. Let M„ denote the cost for one matrix-vector multiplication by the matrix A, and let A;,^ be the total 
number of iterations performed, i.e., k^oi — kfrom Algorithm^ The computational cost of Algorithm^is given as 

2fetotM„ + Oikiotn) + 0(mn) + 0{ki^). (28) 

To see this, we discuss the dominating contributions to the computational cost as we did for Algorithm ]T] The 
initialization phase has cost 0(mn), since R,L e C"^'". In each sweep through the for-loop, updating Hj and the 
Lanczos vectors has cost 2M„ + 0(n), which gives a total of 2ktotM„ + <9(fetotn)- The last line of the algorithm requires 
operations to compute f{Hk,^^) and additional 0{kioin) operations to get .^^t,,,. 

5. Choice of the rational function 

In this section we address the issue of how to find good rational approximations to the sign function in the non- 
Hermitian case. 

In the Hermitian case, if we know intervals {-b, -a], [a, b} which contain the (deflated) spectrum of A, the sign 
function of A can be approximated using the Zolotarev best rational approximation, see ll28l and, e.g., Il29l l6l. Using 
the Zolotarev approximation on non-Hermitian matrices gives rather poor results, unless all eigenvalues are close 
to the real axis (see the left plot in Figure JT}. A better choice for generic non-Hermitian matrices is the rational 
approximation originally suggested by Kenney and Laub BOl and used by Neuberger Il3n[32l for vanishing chemical 
potential, 

(t + l)2s _ (f _ 1)2-5 

sign(0 . gAct) , where gAt) = |^ ^ ^ _ - (29) 
9 




Real{t) Real(t) 



Figure 1: Error of the Zolotarev rational approximation (left) and the Neuberger rational approximation (right). Both rational approximations are 
of the form gjt) = 1 (^i/{t^ - o";) with o"; < 0. We took i = 10 in both cases and plotted the contours for log|Q(gio(f) - sign(f)). We chose 
a = \ and b = 10 for the Zolotarev approximation, and c = I / VlO for the Neuberger approximation. The white spots on the imaginary axis mark 
the poles / = ±i '\/-(Ti of the rational approximation that lie in the interval /[- 10, 10]. 



Note that g,(t) = ^^(1 /f), and g,(t) = tanh (2s atanh f) for 

1 



.(0 



with LOi 



■ cos 



< 1 . The partial fraction expansion of gs is known to be 

(30) 



•tan^ 



see BOl [3n . In ( |29] l, c > is a parameter which one chooses to minimize the number of poles s of the partial 
fraction expansion ([30|, see |6| and the discussion after Theorem [T]below. Whereas for the Zolotarev approximation 
the regions of good approximation are concentrated along the real axis, the approximation gs(t) approaches sign(f) 
well on circles to the left and right of the imaginary axis, see the right plot in Figure[T] For this reason, the Neuberger 
approximation is better suited for generic non-Hermitian matrices. All we need is some a priori information on the 
spectrum from which we can determine an appropriate circle C in the right half-plane, centered on the real axis, 
such that C together with -C contains all the eigenvalues. We then can compute the degree s of the Neuberger 
approximation such that the sign function is approximated to a given accuracy on C U -C. 
The following theorem gives the insight necessary for this approach to work. 



Theorem 1. For given s and e > we have 

\es(t)\ = \gAt) - sign(f)| <£ forte C,,, or 



teC, 



(31) 



where Cs^^ is the circle with radius R = 26{e, s)/[6{€, s)^ — 1] and center M = [6(e,s)^ + l]/[S(e,s)^ — 1], with 
6(e,s) = (2/e+ 1) 



l/(2i) 



Proof. Assume that t is in the right half-plane (the case of f in the left half-plane can be treated in a completely 
analogous manner). With z-[it+ l)/(/ - 1)]^* we write gs(t) -{z - l)/(z + 1) such that Csit) - gs(t) - 1 = -2/{z + 1). 
Therefore \es{t)\ < e if and only if |z -H 1| > 2/e. 

Since |z| - 1 < |z + 1|, a sufficient condition for |e.5(f)l < e is |z| - 1 > 2/e, which is equivalent to 



t+ 1 



t- 1 



+ 1 



l/(2s) 



6(e, s) . 



(32) 



Let t = X + iyhe on the circle Cj_e, i.e., (x - Mf- + - R^. Then 



t+ 1 



t- 1 



(x+lf+y^ 
(x- 1)2 +/ 

(x+lf+R^ 



(x - MY 



(x-\Y +R'^ -(x- Mf 
2jt:(M+ 1)+ l+R^-M'^ 
2jc(M- 1) + 1 + /?2 " 



(33) 
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In fact we have I + R- - ^ I + (R - M)(R + M) = 1 - • = and thus 



1 +f 



1 -f 



6(€,s)+l 
5(e,s)- + l 



(34) 



A/ - 1 

i(E,.5)--l 



1 



So we have shown that \es(t)\ < e on the boundary of the circle C, f, and by the maximum modulus principle this also 
holds for t inside the circle. □ 



The parameter c in ( 29 1 can now be used in order to optimize the number of poles for a given target accuracy e if 
the spectrum of the operator is known to be contained in the union of two circles C(m, r) U C{-m, r), where C{m, r) is 
the circle {\z - m\ < r] and m is real, < r < m. For symmetry reasons it is again sufficient to discuss only the circle 
in the right half-plane, C(m, r). Note that i is a positive integer. Restricting the function gs{t) to real arguments, we 
see that it is positive on (0, oo), monotonically increasing on f e (0, 1], and that gs{t) - gsi^/t) as well as gs{l) = 1. 
The maximum error e„j.^^ - max,(=[,„_;.,„+r] |1 - gs{ct)\ is therefore smallest if c is chosen such that the scaled interval 
[c(m - r), c(m + r)] is of the form [l/d,d]. This is the case for c = ((m + r)(m - r))"'^^ with d = ((;« + r)/(m - r))'^^, 
see also |6 ||^ For this choice of c we see that t is in C(m,r) if and only if ct is in C(M,R) with M - ^"d 
R - But C(M,R) is precisely of the form that was considered in Theorem [l] with 6(e, s) - j^i. Therefore, if 
we want the error \gs(ct) - 1| to be smaller than e for f e C{m, r), Theorem [T| tells us that it is sufficient to require 
^ = 6{e, s) - (2/e + l)'/^^**. Solving for s we see that this precision is obtained if the number s of poles satisfies 

s > Ty-T-. (35) 

2-log(^) 

6. Numerical results 

This section contains the results of several numerical experiments comparing some of the methods developed in 
this paper. We only present results for Algorithms[T]and|4]since the results for Algorithms |2] and |3] are very similar to 
those of Algorithm [T] 1 20 1 . Algorithms [T] and |4] as described in Section]?] were applied to compute sign(Hw)b, where 
Hw = ysDwip) is the "Hermitian" Wilson Dirac operator at nonzero chemical potential and = (1, . . . , 1) for generic 
QCD gauge field configurations on lattices with sizes 4^, 6*, 8"^, and 10"*. The lattice parameters are/? = 5.1, mw = -2, 
nig - 0, and yU = 0.3, see |5l for the notation. 

In Algorithm]!] one has to decide which rational approximation to use. This decision should be made depending 
on the spectrum of A. Even though in lattice QCD the eigenvalues do not move far away from the real axis for 
reasonable values of fi, we adopted a conservative strategy and used the Neuberger approximation in our numerical 
experiments. As we discussed at the end of section ]5] in order to use a Neuberger rational approximation we have 
to determine circles C(m, r) and C{-m, r) which should contain all the eigenvalues (except the ones that have been 
deflated). Of course, we cannot precompute the whole spectrum, so we have to rely on a reasonable heuristics. From 
the deflation process we know a parameter a > such that all non-deflated eigenvalues have modulus larger than a. 
We also precomputed the eigenvalue which is largest in modulus with value /3 > 0. The heuristics, which is confirmed 
by additional numerical experiments, is to assume that for reasonable values of jj all eigenvalues are contained in 
the two circles centered on the real line and intersecting it at the points a, /3 and -a, -/3, respectively. This gives 



m = (a + 13)12 and r - (p - a)l2. The number of poles to use is now given by (35 i together with the corresponding 
(scaled) Neuberger approximation. Note that this approach is quite defensive since it allows eigenvalues to deviate 
substantially from the real axis if their real parts are not close to a, /3, -a or -/3. For larger lattice volumes and fi 
relatively small we observed that the Zolotarev approximation based on the intervals [-j3,-a] and [a,f5] can be an 
interesting alternative, since the spectrum deviates only marginally from the real axis. Using Zolotarev instead of 
Neuberger would reduce the computational cost for the restarted FOM method since the number of poles s would be 



'We thank an anonymous referee for pointing out that this discussion also shows that we could reduce the error from e^ax to emax/(2 - emax) if 
we multiplied gs(t) by a = 2/(2 - emax)- 
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reduced (since this moves the smallest shifts away from the origin, it also leads to a reduction in fetot)- However, as 
mentioned above, we only used the more conservative Neuberger approximation. 

In Algorithm [T] one also has to decide when the iteration to solve any of the linear systems is considered to 
be converged. We require the norms of the residuals to be less than e, with e the target accuracy of the rational 
approximation defined in Eq. ( [3T| l. This gives an upper bound of ■x, 2e on the total error In our experiments we 
observed that the total error (as defined in the next paragraph) was smaller (as small as O.le), which is natural since 
most of the eigenvalues are in the interior of the circles C(m, r) and C(-m, r), where the approximation works better 
than at the boundary. 

We now turn to the question of how to determine the accuracy of the approximations to the sign function in our 
numerical tests. The exact error cannot be determined because the computational cost to evaluate sign(A)fe exactly 
by a direct method is too large if A is large. To obtain an estimate for the error, we compute sign(A)^Z7 (by applying 
sign(A) twice in succession), which should equal b if the approximation to the sign function were exact, and then take 
5II sign(A)^Z7 - as a measure for the error (or accuracy). Of course, in production runs one would check the 

quality of the approximation only occasionally. 

In Figure |2] we compare the results of the restarted FOM-LR approximation with those of the direct two-sided 
Lanczos-LR method for various lattice sizes: 4"* (n = 3,072), 6"* (n = 15,552), 8"* (n = 49,152), and 10^ (n = 
120, 000), and for two different deflation gaps, i.e., the modulus of the smallest non-deflated eigenvalue "* ^ The accu- 
racy is shown as a function of the number of matrix-vector multiplications (left) and as a function of the CPU time on 
a 2.4 GHz Intel Core 2 with 8 GB of memory (right). Although the number of matrix- vector multiplications is often 
used to compare the efficiency of different iterative methods, it is not the best measure of the efficiency since it only 
includes the first term in Eqs. p6l ) and ( [28| ), respectively]^ The total run time is a better measure since it includes the 
other terms as well. Depending on the parameters actually used, some of these terms can be dominant or negligible. 
E.g., the 0{k^) term in the two-sided-Lanczos-LR method becomes dominant when the Krylov subspace grows. An- 
other example are the 0{mn) terms in both algorithms, which reflect the cost of using the deflated eigenvectors and 
which could be neglected in all cases we considered. Note that for the two smaller lattices a larger fraction of the 
problem fits in cache, which leads to a reduction of the run time. 

In Figure [3] we show how the efficiency of both methods scales with the volume. This figure should be interpreted 
with care. Since we used a constant deflation gap we expect the number of iterations (A;tot resp. k) to be approximately 
constant I^This would result in a contribution to the execution time which is linearly dependent on the volume, for both 
methods. However, there are several effects which obscure this linear dependence. For example, in the restarted FOM 
there is a dependence on ^max- In the direct two-sided Lanczos method the 0{k^) cost to compute f{Hk) dominates for 
small volumes. In addition, there are the cache effects already mentioned. 

The restarted FOM-LR method contains three tunable parameters: the deflation gap m, the number s of poles in 
the partial fraction expansion and the restart size k^^x, i e-, the maximal size of the Krylov subspace before restarting. 
Figure |4] shows the effect of the restart size on the CPU time used by the restarted FOM-LR method. Clearly, there is 
an optimal size which should be determined before performing production runs. The number of poles in the partial 
fraction expansion is chosen adequately to achieve the desired accuracy, and strongly depends on the deflation gap. In 
our numerical results the number of poles varied between 8 and 70. 



7. Conclusions 



At nonzero chemical potential, the overlap Dirac operator contains the sign function of the Wilson operator Hv/ — 
ysDw, which is non-Hermitian. The by far most expensive part when applying the overlap Dirac operator to a field 



■^The plots in Figure [2] are for a single configuration per volume. One might ask to what extent this configuration is typical. In the present 
context the main diff"erence between configurations lies in the magnitude of their smallest Dirac eigenvalues. The removal of the latter by deflation 
makes the configuration typical. 

'Note that the cost of deflation, i.e., the cost to compute the in critical eigenvalues and eigenvectors, is not included in these figures (and in the 
figures below) because it only needs to be paid once for each A. In the case of lattice QCD, sign(A)fo has to be computed for many dift'erent h in an 
iterative inverter. One should then choose m such that the total run time, including the cost of deflation (which strongly depends on the details of 
A), is minimized. However, this optimization issue is not the focus of the current paper. 

*Note that FOM-LR applied to Eq. (30} works with A", so we actually have C„ = 2M„. 

'This is not necessarily so for the small lattices, where the superlinear convergence of Krylov subspace methods might become noticeable. 
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#M*v CPU time 
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#M*v CPU time 



Figure 2: Comparison of the accuracy of the restarted FOM-LR algorithm (rFOM) and the direct two-sided Lanczos-LR method (2sL) as a function 
of the number of matrix-vector multiplications (left) and the CPU time in seconds (right) for a 4* (row 1), (row 2), 8^* (row 3), and 10"* (row 4) 
lattice configuration. Each plot shows data for two dift'erent deflation gaps, given in parentheses. The restart size used in the restarted FOM-LR 
algorithm is imax = 30 for the 4^* lattice, i:,„ax = 40 for the 6* and 8"* lattices and inmx = 50 for the 10** lattice. 
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dimension 



Figure 3: Run time (in seconds) for tlie restarted FOM-LR algoritiim and tlie direct two-sided Lanczos-LR metliod as a function of tlie matrix size 
to achieve an accuracy of 10"^. Tlie run time does not include the cost of deflation. The deflation gaps are given in parentheses, and the restart 
sizes are the same as in Figurepl The data point for 2sL(0.05) on a lO'' lattice (open circle) was computed in double-pass for memory reasons. 




20 40 60 80 100 120 140 20 40 60 80 100 120 140 

restart size restart size 



Figure 4: Dependence of the number of iterations and run time (in seconds) on the restart size for the restarted FOM-LR method to achieve an 
accuracy of 10"* for an 8"* configuration (left) and a 10* configuration (right). The deflation gap is 0.1 in both cases. 

vector b — the standard step in any iterative solver for the overlap Dirac operator — is the computation of the action 
of sign(Hw) on b. As a step towards developing computationally feasible methods for the dynamical simulation of 
overlap fermions at nonzero chemical potential, we proposed in this paper several short-recurrence Krylov subspace 
methods to efficiently compute sign(Hw)b. 

One class of methods is based on restarts of the Arnoldi process and requires a precise rational approximation for 
the sign function on the (complex) spectrum of the Wilson operator. This means that we need to have information 
on the location of the spectrum in the complex plane and that we have to adapt the number of poles in the rational 
approximation accordingly. The storage requirements for these methods depend on the restart value, a parameter 
which has to be tuned to be optimal, and the number of poles in the rational approximation. Storage does not depend 
on the number of iterations to be performed. 

The other class of methods relies on the two-sided Lanczos process. We can use a rational function approximation, 
in which case the comments made in the previous paragraph apply as well, except that there is no restart. Alternatively, 
the sign function can be evaluated directly. In that case, if a two-pass strategy is used, the storage requirements are 
minimal; otherwise storage increases linearly with the number of iterations. No a priori knowledge on the spectrum 
is required. If the number of iterations to be performed gets large, the work spent in evaluating the sign function 
of the projected operator, which is represented by a tridiagonal matrix, becomes decisive in terms of computational 
cost. Therefore, the methods based on a rational function approximation were faster in the numerical experiments 
that we performed on lattices with sizes ranging from 4"* to 10"*. However, fast methods are currently being developed 
to compute the sign of the projected tridiagonal matrix, which will speed up the direct two-sided Lanczos method 
substantially |26|. 

For both classes of methods the deflation of critical eigenvalues is an important ingredient towards efficiency. We 
showed that LR deflation is to be preferred to Schur deflation. 
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